In the following exercises, we will use the data you have collected in the previous session (all comments for the video “The Census” by Last Week Tonight with John Oliver. You might have to adjust the following code to use the correct file path on your computer.

comments <- readRDS("../data/LWT_Census_parsed.rds")

Next, we go through the preprocessing steps described in the slides. In a first step, we remove newline commands from the comment strings (without emojis).

library(tidyverse)

comments <- comments %>% 
  mutate(TextEmojiDeleted = str_replace_all(TextEmojiDeleted,
                                            pattern = "\\\n",
                                            replacement = " "))

Next, we tokenize the comments and create a document-feature matrix from which we remove English stopwords.

library(quanteda)

toks <- comments %>% 
  pull(TextEmojiDeleted) %>% 
  char_tolower() %>% 
  tokens(remove_numbers = TRUE,
               remove_punct = TRUE,
               remove_separators = TRUE,
               remove_symbols = TRUE,
               remove_hyphens = TRUE,
               remove_url = TRUE)

comments_dfm <- dfm(toks, 
                   remove = quanteda::stopwords("english"))

1

Which are the 20 most frequently used words in the comments on the video “The Census” by Last Week Tonight with John Oliver? Save the overall word ranking in a new object called term_freq.
You can use the function textstat_frequency() from the quanteda package to answer this question.